Fuzzy c-means clustering is widely used to identify cluster structures inhigh-dimensional data sets, such as those obtained in DNA microarray andquantitative proteomics experiments. One of its main limitations is the lack ofa computationally fast method to determine the two parameters fuzzifier andcluster number. Wrong parameter values may either lead to the inclusion ofpurely random fluctuations in the results or ignore potentially important data.The optimal solution has parameter values for which the clustering does notyield any results for a purely random data set but which detects clusterformation with maximum resolution on the edge of randomness. Estimation of theoptimal parameter values is achieved by evaluation of the results of theclustering procedure applied to randomized data sets. In this case, the optimalvalue of the fuzzifier follows common rules that depend only on the mainproperties of the data set. Taking the dimension of the set and the number ofobjects as input values instead of evaluating the entire data set allows us topropose a functional relationship determining its value directly. This resultspeaks strongly against setting the fuzzifier equal to 2 as typically done inmany previous studies. Validation indices are generally used for the estimationof the optimal number of clusters. A comparison shows that the minimum distancebetween the centroids provides results that are at least equivalent or betterthan those obtained by other computationally more expensive indices.
展开▼